Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature/semantic_text] Register semantic text sub fields in the mapping #106560

Merged
merged 16 commits into from
Mar 22, 2024

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Mar 20, 2024

This PR refactors the semantic text field mapper to register its sub fields in the mapping instead of re-creating them each time when parsing documents.
It also fixes the generation of these fields in case the semantic text field is defined in an object field.
Lastly this change adds a new section called model_settings in the field parameter that is updated by the field mapper when inference results are received from a bulk action. The model settings are available in the fields as soon as the first document with the inference field is ingested and they are used to validate that updates. They are used to ensure consistency between what's used in the bulk action and what's defined in the field mapping.

Note: This PR is opened against the feature branch: feature/semantic_text

@jimczi jimczi requested review from carlosdelest and Mikep86 March 20, 2024 15:30
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Mar 20, 2024
Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd like to check if we can avoid serializing inference_id on model settings for the field mapper, to avoid duplicating that information in the mapping.

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, none that I think block merging though

for (XContentParser.Token token = parser.nextToken(); token != XContentParser.Token.END_OBJECT; token = parser.nextToken()) {
switch (parser.currentName()) {
case RESULTS -> parseResultsList(xContentLocation, parser, context, nestedMapper);
default -> throw new DocumentParsingException(xContentLocation, "Unknown field name " + parser.currentName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We originally skipped unknown field names to be forward-compatible with any new fields added in the future. Any concerns about that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR preserves this behaviour but only for the _inference.$field_name.results object. I am not a fan of this leniency but it is needed since we'll want to extend the information we provide in the future. I thought that we can be strict on the top level fields here since they should not change.

MappedFieldType mappedFieldType,
CopyTo copyTo,
IndexVersion indexVersionCreated,
IndexAnalyzers indexAnalyzers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit silly that we're defining the index analyzers, given that we're not indexing or storing the text field.

The point of this text field is to store the text that generated the chunk in _source, correct? What if we made it a keyword field instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good call, I wonder if we need to explicitly map this field though. Maybe we can just avoid creating it in the mapping entirely?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to index or analyze it in any way; users can use multifields / copy_to in case they want to search for it. I'd vote for skipping the mapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to a keyword field instead so that it's clear that the field is there but not indexed.

@@ -55,25 +56,7 @@ setup:
index: test-index
id: doc_1
body:
non_inference_field: "you know, for testing"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests were meant to address the parsing of the inference results. I don't see this explicitly tested in other tests. Should we keep these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these tests to InferenceMetadataFieldMapperTests since it is not possible anymore to provide the _inference field manually in a bulk request.

*/
public class SemanticTextFieldMapper extends FieldMapper {
private static final Logger logger = LogManager.getLogger(SemanticTextFieldMapper.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using the logger anymore I think

+ TASK_TYPE_FIELD.getPreferredName()
+ "], expected "
+ TEXT_EMBEDDING
+ "or "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String formatting error: There's no space between "TEXT_EMBEDDING" and "or"

@jimczi jimczi merged commit d4e283d into elastic:feature/semantic-text Mar 22, 2024
11 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs:triage Requires assignment of a team area label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants